Project Group 10
Dataset Overview:
A collection of bacterial isolates with associated patient demographics, clinical factors and antibiotic susceptibility results.
Aims:
- Determine which clinical or individual variables significantly influence the number of resistances in an infection using linear models
- Explore antibiotic resistance profiles across species using visualisation techniques such as heatmaps
- Perform PCA to identify clustering based on resistance profiles
Data: https://www.kaggle.com/datasets/adilimadeddinehosni/multi-resistance-antibiotic-susceptibility/data
Data Loading, Cleaning & Overview
library(Rkaggle): Load data directly from Kaggletable1() and ggplot()library(here)
library(table1)
# Load Data
mdr_data <- read.csv(here("data/cleaned_bacteria_data.csv"))
# Create table
t1 <- table1(
~ Age + Gender + Diabetes + Hypertension + Hospital_before +
Infection_Freq +
AMX.AMP + AMC + CZ + FOX + CTX.CRO + IPM + GEN + AN +
Acide.nalidixique + ofx + CIP + C + Co.trimoxazole +
Furanes + colistine
| Species,
data = mdr_data
)
t1Data Loading, Cleaning & Overview
library(Rkaggle): Load data directly from Kaggletable1() and ggplot()Data Loading, Cleaning & Overview
library(Rkaggle): Load data directly from Kaggletable1() and ggplot()New data frame in long format, and creation and modification of variables for the downstream analyses.
tibble(): Creates a lookup table linking antibiotic codes with their full names and pharmacological classespivot_longer() + left_join(): Reshapes the dataset and merges antibiotic information#MDR variable
antibiotics_cols <- colnames(mdr_data)[c(8:ncol(mdr_data))]
#Apply the case_when function to mutate every antibiotic column and sum into MDR variable
#Count resistance is a function in 99_proj_func.R
mdr_wide <- mdr_data |>
mutate(
across(antibiotics_cols,
count_resistance)) |>
mutate(MDR = rowSums(across(antibiotics_cols))) |>
drop_na(Age, Gender, Species, Diabetes, Hypertension, Hospital_before, Infection_Freq, MDR) #Drop all NA in the columns used to model
...
#Age-group variable
breaks <- seq(0, 90, 10)
labels <- paste0("[", head(breaks, -1), ",", tail(breaks, -1), "]")
mdr_wide <- mdr_wide |>
mutate(age_group = cut(
x = Age,
breaks = breaks,
right = FALSE,
include.lowest = TRUE,
labels = labels))Analysis of how age, gender, bacterial species, diabetes, hypertension, prior hospitalisation and infection frequency influence the number of antibiotic resistances in a contracted infection using a linear model.
lm(): Fits the linear model predicting MDRbroom::tidy(): Extracts coefficients, confidence intervals, and p-valuesmutate(): Cleans variable names and adds significance flagstringr::str_replace(): Standardises variable labels#Linear model
linear_model <- MDR_df |>
lm(formula = MDR ~ Age + Gender + Species + Diabetes + Hypertension + Hospital_before + Infection_Freq,
data = _)
...
#Tidy format
tidy_lm <- tidy(linear_model,
conf.int = TRUE,
conf.level = 0.95) |>
mutate(term = str_replace(term, "Species", "")) |>
mutate(term = str_replace(term, "Yes", "")) |>
mutate(term = str_replace(term, "GenderM", "Gender_M")) |>
mutate(term = str_replace(term, "Freq1", "Freq_1")) |>
mutate(term = str_replace(term, "Freq2", "Freq_2")) |>
mutate(term = str_replace(term, "Freq3", "Freq_3")) |>
mutate(sig = factor(p.value < 0.05))Only bacterial species significantly affect MDR
Evaluated whether any of the factors have any association with the infection frequency
Clinical factors show no association with infection frequency
Infection frequency varied significantly by the bacterial species
The number of drug resistances also shows no association with the infection frequency
The multivariable model shows that the bacterial species was the only meaningful contributor.
| Model 1 | Model 2 | Model 3 | Model 4 |
|---|---|---|---|
| Patient details | Bacterial species | MDR | Multivariate |
| p = 0.5394 | p = 0.03842 | p = 0.6002 | p = 0.08977 |
A resistance patterns across bacterial species using the long-format dataset: E. coli and K. pneumoniae show the highest resistance (red), most other species remain largely sensitive (blue), and several antibiotics stay effective across multiple species.
Identify shared patterns of antibiotic resistance across bacterial species.
select(): Extracts only the antibiotic-resistance columns used for PCAprcomp(): Performs the PCA with variable scalingaugment(): Attaches PCA scores (PC1, PC2…) back to the original datasetcount() — Determines number of isolates per speciesslice_sample(): Downsamples species to equal sizes for balanced PCA#Variable selection and PCA
antibiotics_cols <- colnames(MDR_df)[c(8:(ncol(MDR_df)-1))]
pca_fit <- MDR_df |>
select(all_of(antibiotics_cols)) |>
prcomp(scale = TRUE)
...
#Augment the original dataset
pca_all_plot <- pca_fit |>
augment(MDR_df)
...
#Downsampling
min_n <- MDR_df |>
count(Species) |>
summarise(min(n)) |>
pull()
balanced_df <- MDR_df |>
group_by(Species) |>
slice_sample(n = min_n) |>
ungroup()After balancing species counts, the PCA shows largely shared resistance patterns across taxa, with only a few species, such as P. aeruginosa and S. marcescens, forming distinct clusters separate from the main E. coli-dominated continuum.